For this series of lectures, we will be using the famous Iris flower data set.
The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by Sir Ronald Fisher in the 1936 as an example of discriminant analysis.
The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor), so 150 total samples. Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters.
Here's a picture of the three different Iris types:
# The Iris Setosa
from IPython.display import Image
url = 'http://upload.wikimedia.org/wikipedia/commons/5/56/Kosaciec_szczecinkowaty_Iris_setosa.jpg'
Image(url,width=300, height=300)
# The Iris Versicolor
from IPython.display import Image
url = 'http://upload.wikimedia.org/wikipedia/commons/4/41/Iris_versicolor_3.jpg'
Image(url,width=300, height=300)
# The Iris Virginica
from IPython.display import Image
url = 'http://upload.wikimedia.org/wikipedia/commons/9/9f/Iris_virginica.jpg'
Image(url,width=300, height=300)
The iris dataset contains measurements for 150 iris flowers from three different species.
The three classes in the Iris dataset:
Iris-setosa (n=50)
Iris-versicolor (n=50)
Iris-virginica (n=50)
The four features of the Iris dataset:
sepal length in cm
sepal width in cm
petal length in cm
petal width in cm
Use seaborn to get the iris data by using: iris = sns.load_dataset('iris')
import seaborn as sns
iris=sns.load_dataset('iris')
Let's visualize the data and get you started!
Create a pairplot of the data set. Which flower species seems to be the most separable?
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
iris.head()
Create a kde plot of sepal_length versus sepal width for setosa species of flower.
sns.pairplot(iris,hue='species',palette='Dark2')
# as we csn see setosa most sepratable species
# lets explore KDE plot for setosa to explore more
setosa = iris[iris['species']=='setosa']
sns.kdeplot( setosa['sepal_width'], setosa['sepal_length'],
cmap="plasma", shade=True, shade_lowest=False)
Split your data into a training set and a testing set.
from sklearn.cross_validation import train_test_split
X=iris.drop('species', axis=1)
y=iris['species']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
Now its time to train a Support Vector Machine Classifier.
Call the SVC() model from sklearn and fit the model to the training data.
from sklearn.svm import SVC
svc_model=SVC()
svc_model.fit(X_train,y_train)
Now get predictions from the model and create a confusion matrix and a classification report.
prediction=svc_model.predict(X_test)
from sklearn.metrics import classification_report,confusion_matrix
print(confusion_matrix(y_test,prediction))
print(classification_report(y_test,prediction))
Wow! model was pretty good!